This report describes the process behind the creation of a Machine Learning Model used to classify weight lifting exercise (unilateral dumbbell biceps curling) in classes:
More about the research and data used can be found on the following website: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises#ixzz6CzLP0YxO
Our accuracy measured on a validation dataset (part of training) was higher than 95%.
Our data source urls:
TRAINING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
TESTING_SOURCE_FILE_URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Loading and splitting the data (training, validating and testing):
NA_STRINGS <- c("NA","#DIV/0!")
training <- read.csv(TRAINING_FILE_PATH, na.strings = NA_STRINGS)
testing <- read.csv(TESTING_FILE_PATH, na.strings = NA_STRINGS)
in.training <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
validating <- training[-in.training, ]
training <- training[in.training, ]
How our original training dataset looks like:
dim(training)
## [1] 13737 160
print(table(training$classe))
##
## A B C D E
## 3906 2658 2396 2252 2525
Checking the presence of NAs per variable:
na.stats
##
## (-0.001,0.05] (0.95,1]
## 60 100
A large number of variables have 95% or more of NAs values. These variables will be ignored on our models.
We will also remove other variables that should not be related to the response, such as timestamps and the name of the atlet.
unwanted.columns <- c("X", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window", "user_name")
You can see plots and more details about the training dataset on the Appendix.
New dimensions of our training dataset:
dim(training)
## [1] 13737 53
PCA:
model.pca <- preProcess(training, method = "pca", thresh = 90/100)
training.pcs <- predict(model.pca, training)
validating.pcs <- predict(model.pca, validating)
Columns of our training dataset only with the principal components:
ncol(select(training.pcs, -classe))
## [1] 18
We are training 2 models: Random Forest and GBM.
model.rf <- train(classe ~ ., method = "rf", data = training.pcs, trControl = trainControl(method="cv"), number = 3)
model.gbm <- train(classe ~ ., method = "gbm", data = training.pcs, verbose = FALSE)
Accuracy of our models on the training dataset:
## [1] "Random Forest Accuracy = 1"
## [1] "GBM Accuracy = 0.838174273858921"
Now, checking the accuracy on the validation dataset:
## [1] "Random Forest Accuracy = 0.974001699235344"
## [1] "GBM Accuracy = 0.7928632115548"
As the Random Forest model is already achieving a very high accuracy, no stacking/ensemble will be performed.
testing.pcs <- predict(model.pca, testing)
result <- predict(model.rf, newdata = testing.pcs)
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We were able to build a very accurate model for our classification problem.
Variables there are being ignored:
unwanted.columns
## [1] "X" "raw_timestamp_part_1" "raw_timestamp_part_2"
## [4] "cvtd_timestamp" "new_window" "num_window"
## [7] "user_name"
almost.empty.columns
## [1] "kurtosis_yaw_belt" "skewness_yaw_belt"
## [3] "kurtosis_yaw_dumbbell" "skewness_yaw_dumbbell"
## [5] "kurtosis_yaw_forearm" "skewness_yaw_forearm"
## [7] "kurtosis_picth_forearm" "skewness_pitch_forearm"
## [9] "kurtosis_roll_forearm" "skewness_roll_forearm"
## [11] "max_yaw_forearm" "min_yaw_forearm"
## [13] "amplitude_yaw_forearm" "kurtosis_picth_arm"
## [15] "skewness_pitch_arm" "kurtosis_roll_arm"
## [17] "skewness_roll_arm" "kurtosis_picth_belt"
## [19] "skewness_roll_belt.1" "kurtosis_yaw_arm"
## [21] "skewness_yaw_arm" "kurtosis_roll_belt"
## [23] "skewness_roll_belt" "max_yaw_belt"
## [25] "min_yaw_belt" "amplitude_yaw_belt"
## [27] "kurtosis_roll_dumbbell" "skewness_roll_dumbbell"
## [29] "max_yaw_dumbbell" "min_yaw_dumbbell"
## [31] "amplitude_yaw_dumbbell" "kurtosis_picth_dumbbell"
## [33] "skewness_pitch_dumbbell" "max_roll_belt"
## [35] "max_picth_belt" "min_roll_belt"
## [37] "min_pitch_belt" "amplitude_roll_belt"
## [39] "amplitude_pitch_belt" "var_total_accel_belt"
## [41] "avg_roll_belt" "stddev_roll_belt"
## [43] "var_roll_belt" "avg_pitch_belt"
## [45] "stddev_pitch_belt" "var_pitch_belt"
## [47] "avg_yaw_belt" "stddev_yaw_belt"
## [49] "var_yaw_belt" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "max_roll_arm"
## [61] "max_picth_arm" "max_yaw_arm"
## [63] "min_roll_arm" "min_pitch_arm"
## [65] "min_yaw_arm" "amplitude_roll_arm"
## [67] "amplitude_pitch_arm" "amplitude_yaw_arm"
## [69] "max_roll_dumbbell" "max_picth_dumbbell"
## [71] "min_roll_dumbbell" "min_pitch_dumbbell"
## [73] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [75] "var_accel_dumbbell" "avg_roll_dumbbell"
## [77] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [79] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [81] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [83] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [85] "max_roll_forearm" "max_picth_forearm"
## [87] "min_roll_forearm" "min_pitch_forearm"
## [89] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [91] "var_accel_forearm" "avg_roll_forearm"
## [93] "stddev_roll_forearm" "var_roll_forearm"
## [95] "avg_pitch_forearm" "stddev_pitch_forearm"
## [97] "var_pitch_forearm" "avg_yaw_forearm"
## [99] "stddev_yaw_forearm" "var_yaw_forearm"
With the removal of variables with too many NAs, there are no more NearZeroVars as well:
nzv <- nearZeroVar(training, saveMetrics = TRUE)
print(nzv[nzv$nzv,])
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
Boxplot for each numeric variable per classe: